{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The usual preamble\n",
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "# Make the graphs a bit prettier, and bigger\n",
    "plt.style.use('ggplot')\n",
    "plt.rcParams['figure.figsize'] = (15, 5)\n",
    "plt.rcParams['font.family'] = 'sans-serif'\n",
    "\n",
    "# This is necessary to show lots of columns in pandas 0.12. \n",
    "# Not necessary in pandas 0.13.\n",
    "pd.set_option('display.width', 5000) \n",
    "pd.set_option('display.max_columns', 60)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One of the main problems with messy data is: how do you know if it's messy or not?\n",
    "\n",
    "We're going to use the NYC 311 service request dataset again here, since it's big and a bit unwieldy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [],
   "source": [
    "requests = pd.read_csv('../data/311-service-requests.csv', dtype='unicode')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.1 How do we know if it's messy? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to look at a few columns here. I know already that there are some problems with the zip code, so let's look at that first.\n",
    " \n",
    "To get a sense for whether a column has problems, I usually use `.unique()` to look at all its values. If it's a numeric column, I'll instead plot a histogram to get a sense of the distribution.\n",
    "\n",
    "When we look at the unique values in \"Incident Zip\", it quickly becomes clear that this is a mess.\n",
    "\n",
    "Some of the problems:\n",
    "\n",
    "* Some have been parsed as strings, and some as floats\n",
    "* There are `nan`s \n",
    "* Some of the zip codes are `29616-0759` or `83`\n",
    "* There are some N/A values that pandas didn't recognize, like 'N/A' and 'NO CLUE'\n",
    "\n",
    "What we can do:\n",
    "\n",
    "* Normalize 'N/A' and 'NO CLUE' into regular nan values\n",
    "* Look at what's up with the 83, and decide what to do\n",
    "* Make everything strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',\n",
       "       '11417', '10011', '11225', '11218', '10003', '10029', '10466',\n",
       "       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',\n",
       "       '10305', '10312', '10026', '10309', '10036', '11433', '11235',\n",
       "       '11213', '11379', '11101', '10014', '11231', '11234', '10457',\n",
       "       '10459', '10465', '11207', '10002', '10034', '11233', '10453',\n",
       "       '10456', '10469', '11374', '11221', '11421', '11215', '10007',\n",
       "       '10019', '11205', '11418', '11369', '11249', '10005', '10009',\n",
       "       '11211', '11412', '10458', '11229', '10065', '10030', '11222',\n",
       "       '10024', '10013', '11420', '11365', '10012', '11214', '11212',\n",
       "       '10022', '11232', '11040', '11226', '10281', '11102', '11208',\n",
       "       '10001', '10472', '11414', '11223', '10040', '11220', '11373',\n",
       "       '11203', '11691', '11356', '10017', '10452', '10280', '11217',\n",
       "       '10031', '11201', '11358', '10128', '11423', '10039', '10010',\n",
       "       '11209', '10021', '10037', '11413', '11375', '11238', '10473',\n",
       "       '11103', '11354', '11361', '11106', '11385', '10463', '10467',\n",
       "       '11204', '11237', '11377', '11364', '11434', '11435', '11210',\n",
       "       '11228', '11368', '11694', '10464', '11415', '10314', '10301',\n",
       "       '10018', '10038', '11105', '11230', '10468', '11104', '10471',\n",
       "       '11416', '10075', '11422', '11355', '10028', '10462', '10306',\n",
       "       '10461', '11224', '11429', '10035', '11366', '11362', '11206',\n",
       "       '10460', '10304', '11360', '11411', '10455', '10475', '10069',\n",
       "       '10303', '10308', '10302', '11357', '10470', '11367', '11370',\n",
       "       '10454', '10451', '11436', '11426', '10153', '11004', '11428',\n",
       "       '11427', '11001', '11363', '10004', '10474', '11430', '10000',\n",
       "       '10307', '11239', '10119', '10006', '10048', '11697', '11692',\n",
       "       '11693', '10573', '00083', '11559', '10020', '77056', '11776',\n",
       "       '70711', '10282', '11109', '10044', '02061', '77092-2016', '14225',\n",
       "       '55164-0737', '19711', '07306', '000000', 'NO CLUE', '90010',\n",
       "       '11747', '23541', '11788', '07604', '10112', '11563', '11580',\n",
       "       '07087', '11042', '07093', '11501', '92123', '00000', '11575',\n",
       "       '07109', '11797', '10803', '11716', '11722', '11549-3650', '10162',\n",
       "       '23502', '11518', '07020', '08807', '11577', '07114', '11003',\n",
       "       '07201', '61702', '10103', '29616-0759', '35209-3114', '11520',\n",
       "       '11735', '10129', '11005', '41042', '11590', '06901', '07208',\n",
       "       '11530', '13221', '10954', '11111', '10107'], dtype=object)"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests['Incident Zip'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.2 Fixing the nan values and string/float confusion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can pass a `na_values` option to `pd.read_csv` to clean this up a little bit. We can also specify that the type of Incident Zip is a string, not a float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [],
   "source": [
    "na_values = ['NO CLUE', 'N/A', '0']\n",
    "requests = pd.read_csv('../data/311-service-requests.csv', na_values=na_values, dtype={'Incident Zip': str})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',\n",
       "       '11417', '10011', '11225', '11218', '10003', '10029', '10466',\n",
       "       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',\n",
       "       '10305', '10312', '10026', '10309', '10036', '11433', '11235',\n",
       "       '11213', '11379', '11101', '10014', '11231', '11234', '10457',\n",
       "       '10459', '10465', '11207', '10002', '10034', '11233', '10453',\n",
       "       '10456', '10469', '11374', '11221', '11421', '11215', '10007',\n",
       "       '10019', '11205', '11418', '11369', '11249', '10005', '10009',\n",
       "       '11211', '11412', '10458', '11229', '10065', '10030', '11222',\n",
       "       '10024', '10013', '11420', '11365', '10012', '11214', '11212',\n",
       "       '10022', '11232', '11040', '11226', '10281', '11102', '11208',\n",
       "       '10001', '10472', '11414', '11223', '10040', '11220', '11373',\n",
       "       '11203', '11691', '11356', '10017', '10452', '10280', '11217',\n",
       "       '10031', '11201', '11358', '10128', '11423', '10039', '10010',\n",
       "       '11209', '10021', '10037', '11413', '11375', '11238', '10473',\n",
       "       '11103', '11354', '11361', '11106', '11385', '10463', '10467',\n",
       "       '11204', '11237', '11377', '11364', '11434', '11435', '11210',\n",
       "       '11228', '11368', '11694', '10464', '11415', '10314', '10301',\n",
       "       '10018', '10038', '11105', '11230', '10468', '11104', '10471',\n",
       "       '11416', '10075', '11422', '11355', '10028', '10462', '10306',\n",
       "       '10461', '11224', '11429', '10035', '11366', '11362', '11206',\n",
       "       '10460', '10304', '11360', '11411', '10455', '10475', '10069',\n",
       "       '10303', '10308', '10302', '11357', '10470', '11367', '11370',\n",
       "       '10454', '10451', '11436', '11426', '10153', '11004', '11428',\n",
       "       '11427', '11001', '11363', '10004', '10474', '11430', '10000',\n",
       "       '10307', '11239', '10119', '10006', '10048', '11697', '11692',\n",
       "       '11693', '10573', '00083', '11559', '10020', '77056', '11776',\n",
       "       '70711', '10282', '11109', '10044', '02061', '77092-2016', '14225',\n",
       "       '55164-0737', '19711', '07306', '000000', '90010', '11747',\n",
       "       '23541', '11788', '07604', '10112', '11563', '11580', '07087',\n",
       "       '11042', '07093', '11501', '92123', '00000', '11575', '07109',\n",
       "       '11797', '10803', '11716', '11722', '11549-3650', '10162', '23502',\n",
       "       '11518', '07020', '08807', '11577', '07114', '11003', '07201',\n",
       "       '61702', '10103', '29616-0759', '35209-3114', '11520', '11735',\n",
       "       '10129', '11005', '41042', '11590', '06901', '07208', '11530',\n",
       "       '13221', '10954', '11111', '10107'], dtype=object)"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests['Incident Zip'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.3 What's up with the dashes?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows_with_dashes = requests['Incident Zip'].str.contains('-').fillna(False)\n",
    "len(requests[rows_with_dashes])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unique Key</th>\n",
       "      <th>Created Date</th>\n",
       "      <th>Closed Date</th>\n",
       "      <th>Agency</th>\n",
       "      <th>Agency Name</th>\n",
       "      <th>Complaint Type</th>\n",
       "      <th>Descriptor</th>\n",
       "      <th>Location Type</th>\n",
       "      <th>Incident Zip</th>\n",
       "      <th>Incident Address</th>\n",
       "      <th>Street Name</th>\n",
       "      <th>Cross Street 1</th>\n",
       "      <th>Cross Street 2</th>\n",
       "      <th>Intersection Street 1</th>\n",
       "      <th>Intersection Street 2</th>\n",
       "      <th>Address Type</th>\n",
       "      <th>City</th>\n",
       "      <th>Landmark</th>\n",
       "      <th>Facility Type</th>\n",
       "      <th>Status</th>\n",
       "      <th>Due Date</th>\n",
       "      <th>Resolution Action Updated Date</th>\n",
       "      <th>Community Board</th>\n",
       "      <th>Borough</th>\n",
       "      <th>X Coordinate (State Plane)</th>\n",
       "      <th>Y Coordinate (State Plane)</th>\n",
       "      <th>Park Facility Name</th>\n",
       "      <th>Park Borough</th>\n",
       "      <th>School Name</th>\n",
       "      <th>School Number</th>\n",
       "      <th>School Region</th>\n",
       "      <th>School Code</th>\n",
       "      <th>School Phone Number</th>\n",
       "      <th>School Address</th>\n",
       "      <th>School City</th>\n",
       "      <th>School State</th>\n",
       "      <th>School Zip</th>\n",
       "      <th>School Not Found</th>\n",
       "      <th>School or Citywide Complaint</th>\n",
       "      <th>Vehicle Type</th>\n",
       "      <th>Taxi Company Borough</th>\n",
       "      <th>Taxi Pick Up Location</th>\n",
       "      <th>Bridge Highway Name</th>\n",
       "      <th>Bridge Highway Direction</th>\n",
       "      <th>Road Ramp</th>\n",
       "      <th>Bridge Highway Segment</th>\n",
       "      <th>Garage Lot Name</th>\n",
       "      <th>Ferry Direction</th>\n",
       "      <th>Ferry Terminal Name</th>\n",
       "      <th>Latitude</th>\n",
       "      <th>Longitude</th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>29136</th>\n",
       "      <td>26550551</td>\n",
       "      <td>10/24/2013 06:16:34 PM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>DCA</td>\n",
       "      <td>Department of Consumer Affairs</td>\n",
       "      <td>Consumer Complaint</td>\n",
       "      <td>False Advertising</td>\n",
       "      <td>NaN</td>\n",
       "      <td>77092-2016</td>\n",
       "      <td>2700 EAST SELTICE WAY</td>\n",
       "      <td>EAST SELTICE WAY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>HOUSTON</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Assigned</td>\n",
       "      <td>11/13/2013 11:15:20 AM</td>\n",
       "      <td>10/29/2013 11:16:16 AM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30939</th>\n",
       "      <td>26548831</td>\n",
       "      <td>10/24/2013 09:35:10 AM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>DCA</td>\n",
       "      <td>Department of Consumer Affairs</td>\n",
       "      <td>Consumer Complaint</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>NaN</td>\n",
       "      <td>55164-0737</td>\n",
       "      <td>P.O. BOX 64437</td>\n",
       "      <td>64437</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>ST. PAUL</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Assigned</td>\n",
       "      <td>11/13/2013 02:30:21 PM</td>\n",
       "      <td>10/29/2013 02:31:06 PM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>70539</th>\n",
       "      <td>26488417</td>\n",
       "      <td>10/15/2013 03:40:33 PM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>TLC</td>\n",
       "      <td>Taxi and Limousine Commission</td>\n",
       "      <td>Taxi Complaint</td>\n",
       "      <td>Driver Complaint</td>\n",
       "      <td>Street</td>\n",
       "      <td>11549-3650</td>\n",
       "      <td>365 HOFSTRA UNIVERSITY</td>\n",
       "      <td>HOFSTRA UNIVERSITY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>HEMSTEAD</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Assigned</td>\n",
       "      <td>11/30/2013 01:20:33 PM</td>\n",
       "      <td>10/16/2013 01:21:39 PM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>La Guardia Airport</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85821</th>\n",
       "      <td>26468296</td>\n",
       "      <td>10/10/2013 12:36:43 PM</td>\n",
       "      <td>10/26/2013 01:07:07 AM</td>\n",
       "      <td>DCA</td>\n",
       "      <td>Department of Consumer Affairs</td>\n",
       "      <td>Consumer Complaint</td>\n",
       "      <td>Debt Not Owed</td>\n",
       "      <td>NaN</td>\n",
       "      <td>29616-0759</td>\n",
       "      <td>PO BOX 25759</td>\n",
       "      <td>BOX 25759</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>GREENVILLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Closed</td>\n",
       "      <td>10/26/2013 09:20:28 AM</td>\n",
       "      <td>10/26/2013 01:07:07 AM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89304</th>\n",
       "      <td>26461137</td>\n",
       "      <td>10/09/2013 05:23:46 PM</td>\n",
       "      <td>10/25/2013 01:06:41 AM</td>\n",
       "      <td>DCA</td>\n",
       "      <td>Department of Consumer Affairs</td>\n",
       "      <td>Consumer Complaint</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35209-3114</td>\n",
       "      <td>600 BEACON PKWY</td>\n",
       "      <td>BEACON PKWY</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>BIRMINGHAM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Closed</td>\n",
       "      <td>10/25/2013 02:43:42 PM</td>\n",
       "      <td>10/25/2013 01:06:41 AM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Unique Key            Created Date             Closed Date Agency                     Agency Name      Complaint Type         Descriptor Location Type Incident Zip        Incident Address         Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type        City Landmark Facility Type    Status                Due Date Resolution Action Updated Date Community Board      Borough  X Coordinate (State Plane)  Y Coordinate (State Plane) Park Facility Name Park Borough  School Name School Number School Region  School Code School Phone Number School Address  School City School State   School Zip School Not Found  School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name  Latitude  Longitude Location\n",
       "29136    26550551  10/24/2013 06:16:34 PM                     NaN    DCA  Department of Consumer Affairs  Consumer Complaint  False Advertising           NaN   77092-2016   2700 EAST SELTICE WAY    EAST SELTICE WAY            NaN            NaN                   NaN                   NaN          NaN     HOUSTON      NaN           NaN  Assigned  11/13/2013 11:15:20 AM         10/29/2013 11:16:16 AM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN                   NaN                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN\n",
       "30939    26548831  10/24/2013 09:35:10 AM                     NaN    DCA  Department of Consumer Affairs  Consumer Complaint         Harassment           NaN   55164-0737          P.O. BOX 64437               64437            NaN            NaN                   NaN                   NaN          NaN    ST. PAUL      NaN           NaN  Assigned  11/13/2013 02:30:21 PM         10/29/2013 02:31:06 PM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN                   NaN                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN\n",
       "70539    26488417  10/15/2013 03:40:33 PM                     NaN    TLC   Taxi and Limousine Commission      Taxi Complaint   Driver Complaint        Street   11549-3650  365 HOFSTRA UNIVERSITY  HOFSTRA UNIVERSITY            NaN            NaN                   NaN                   NaN          NaN    HEMSTEAD      NaN           NaN  Assigned  11/30/2013 01:20:33 PM         10/16/2013 01:21:39 PM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN    La Guardia Airport                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN\n",
       "85821    26468296  10/10/2013 12:36:43 PM  10/26/2013 01:07:07 AM    DCA  Department of Consumer Affairs  Consumer Complaint      Debt Not Owed           NaN   29616-0759            PO BOX 25759           BOX 25759            NaN            NaN                   NaN                   NaN          NaN  GREENVILLE      NaN           NaN    Closed  10/26/2013 09:20:28 AM         10/26/2013 01:07:07 AM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN                   NaN                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN\n",
       "89304    26461137  10/09/2013 05:23:46 PM  10/25/2013 01:06:41 AM    DCA  Department of Consumer Affairs  Consumer Complaint         Harassment           NaN   35209-3114         600 BEACON PKWY         BEACON PKWY            NaN            NaN                   NaN                   NaN          NaN  BIRMINGHAM      NaN           NaN    Closed  10/25/2013 02:43:42 PM         10/25/2013 01:06:41 AM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN                   NaN                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN"
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests[rows_with_dashes]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I thought these were missing data and originally deleted them like this:\n",
    "\n",
    "`requests['Incident Zip'][rows_with_dashes] = np.nan`\n",
    "\n",
    "But then my friend Dave pointed out that 9-digit zip codes are normal. Let's look at all the zip codes with more than 5 digits, make sure they're okay, and then truncate them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['77092-2016', '55164-0737', '000000', '11549-3650', '29616-0759',\n",
       "       '35209-3114'], dtype=object)"
      ]
     },
     "execution_count": 120,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "long_zip_codes = requests['Incident Zip'].str.len() > 5\n",
    "requests['Incident Zip'][long_zip_codes].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those all look okay to truncate to me."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [],
   "source": [
    "requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Done."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Earlier I thought 00083 was a broken zip code, but turns out Central Park's zip code 00083! Shows what I know. I'm still concerned about the 00000 zip codes, though: let's look at that. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unique Key</th>\n",
       "      <th>Created Date</th>\n",
       "      <th>Closed Date</th>\n",
       "      <th>Agency</th>\n",
       "      <th>Agency Name</th>\n",
       "      <th>Complaint Type</th>\n",
       "      <th>Descriptor</th>\n",
       "      <th>Location Type</th>\n",
       "      <th>Incident Zip</th>\n",
       "      <th>Incident Address</th>\n",
       "      <th>Street Name</th>\n",
       "      <th>Cross Street 1</th>\n",
       "      <th>Cross Street 2</th>\n",
       "      <th>Intersection Street 1</th>\n",
       "      <th>Intersection Street 2</th>\n",
       "      <th>Address Type</th>\n",
       "      <th>City</th>\n",
       "      <th>Landmark</th>\n",
       "      <th>Facility Type</th>\n",
       "      <th>Status</th>\n",
       "      <th>Due Date</th>\n",
       "      <th>Resolution Action Updated Date</th>\n",
       "      <th>Community Board</th>\n",
       "      <th>Borough</th>\n",
       "      <th>X Coordinate (State Plane)</th>\n",
       "      <th>Y Coordinate (State Plane)</th>\n",
       "      <th>Park Facility Name</th>\n",
       "      <th>Park Borough</th>\n",
       "      <th>School Name</th>\n",
       "      <th>School Number</th>\n",
       "      <th>School Region</th>\n",
       "      <th>School Code</th>\n",
       "      <th>School Phone Number</th>\n",
       "      <th>School Address</th>\n",
       "      <th>School City</th>\n",
       "      <th>School State</th>\n",
       "      <th>School Zip</th>\n",
       "      <th>School Not Found</th>\n",
       "      <th>School or Citywide Complaint</th>\n",
       "      <th>Vehicle Type</th>\n",
       "      <th>Taxi Company Borough</th>\n",
       "      <th>Taxi Pick Up Location</th>\n",
       "      <th>Bridge Highway Name</th>\n",
       "      <th>Bridge Highway Direction</th>\n",
       "      <th>Road Ramp</th>\n",
       "      <th>Bridge Highway Segment</th>\n",
       "      <th>Garage Lot Name</th>\n",
       "      <th>Ferry Direction</th>\n",
       "      <th>Ferry Terminal Name</th>\n",
       "      <th>Latitude</th>\n",
       "      <th>Longitude</th>\n",
       "      <th>Location</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>42600</th>\n",
       "      <td>26529313</td>\n",
       "      <td>10/22/2013 02:51:06 PM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>TLC</td>\n",
       "      <td>Taxi and Limousine Commission</td>\n",
       "      <td>Taxi Complaint</td>\n",
       "      <td>Driver Complaint</td>\n",
       "      <td>NaN</td>\n",
       "      <td>00000</td>\n",
       "      <td>EWR EWR</td>\n",
       "      <td>EWR</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NEWARK</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Assigned</td>\n",
       "      <td>12/07/2013 09:53:51 AM</td>\n",
       "      <td>10/23/2013 09:54:43 AM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Other</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60843</th>\n",
       "      <td>26507389</td>\n",
       "      <td>10/17/2013 05:48:44 PM</td>\n",
       "      <td>NaN</td>\n",
       "      <td>TLC</td>\n",
       "      <td>Taxi and Limousine Commission</td>\n",
       "      <td>Taxi Complaint</td>\n",
       "      <td>Driver Complaint</td>\n",
       "      <td>Street</td>\n",
       "      <td>00000</td>\n",
       "      <td>1 NEWARK AIRPORT</td>\n",
       "      <td>NEWARK AIRPORT</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NEWARK</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Assigned</td>\n",
       "      <td>12/02/2013 11:59:46 AM</td>\n",
       "      <td>10/18/2013 12:01:08 PM</td>\n",
       "      <td>0 Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>Unspecified</td>\n",
       "      <td>N</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Other</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Unique Key            Created Date Closed Date Agency                    Agency Name  Complaint Type        Descriptor Location Type Incident Zip  Incident Address     Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type    City Landmark Facility Type    Status                Due Date Resolution Action Updated Date Community Board      Borough  X Coordinate (State Plane)  Y Coordinate (State Plane) Park Facility Name Park Borough  School Name School Number School Region  School Code School Phone Number School Address  School City School State   School Zip School Not Found  School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name  Latitude  Longitude Location\n",
       "42600    26529313  10/22/2013 02:51:06 PM         NaN    TLC  Taxi and Limousine Commission  Taxi Complaint  Driver Complaint           NaN        00000           EWR EWR             EWR            NaN            NaN                   NaN                   NaN          NaN  NEWARK      NaN           NaN  Assigned  12/07/2013 09:53:51 AM         10/23/2013 09:54:43 AM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN                 Other                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN\n",
       "60843    26507389  10/17/2013 05:48:44 PM         NaN    TLC  Taxi and Limousine Commission  Taxi Complaint  Driver Complaint        Street        00000  1 NEWARK AIRPORT  NEWARK AIRPORT            NaN            NaN                   NaN                   NaN          NaN  NEWARK      NaN           NaN  Assigned  12/02/2013 11:59:46 AM         10/18/2013 12:01:08 PM   0 Unspecified  Unspecified                         NaN                         NaN        Unspecified  Unspecified  Unspecified   Unspecified   Unspecified  Unspecified         Unspecified    Unspecified  Unspecified  Unspecified  Unspecified                N                           NaN          NaN                  NaN                 Other                 NaN                      NaN       NaN                    NaN             NaN             NaN                 NaN       NaN        NaN      NaN"
      ]
     },
     "execution_count": 122,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests[requests['Incident Zip'] == '00000']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks bad to me. Let's set these to nan."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [],
   "source": [
    "zero_zips = requests['Incident Zip'] == '00000'\n",
    "requests.loc[zero_zips, 'Incident Zip'] = np.nan"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great. Let's see where we are now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "'<' not supported between instances of 'float' and 'str'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-124-a216f82ef513>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0munique_zips\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrequests\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Incident Zip'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0munique_zips\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msort\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      3\u001b[0m \u001b[0munique_zips\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: '<' not supported between instances of 'float' and 'str'"
     ]
    }
   ],
   "source": [
    "unique_zips = requests['Incident Zip'].unique()\n",
    "unique_zips.sort()\n",
    "unique_zips"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Amazing! This is much cleaner. There's something a bit weird here, though -- I looked up 77056 on Google maps, and that's in Texas.\n",
    "\n",
    "Let's take a closer look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [],
   "source": [
    "zips = requests['Incident Zip']\n",
    "# Let's say the zips starting with '0' and '1' are okay, for now. (this isn't actually true -- 13221 is in Syracuse, and why?)\n",
    "is_close = zips.str.startswith('0') | zips.str.startswith('1')\n",
    "# There are a bunch of NaNs, but we're not interested in them right now, so we'll say they're False\n",
    "is_far = ~(is_close) & zips.notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "12102    77056\n",
       "13450    70711\n",
       "29136    77092\n",
       "30939    55164\n",
       "44008    90010\n",
       "47048    23541\n",
       "57636    92123\n",
       "71001    92123\n",
       "71834    23502\n",
       "80573    61702\n",
       "85821    29616\n",
       "89304    35209\n",
       "94201    41042\n",
       "Name: Incident Zip, dtype: object"
      ]
     },
     "execution_count": 126,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "zips[is_far]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Incident Zip</th>\n",
       "      <th>Descriptor</th>\n",
       "      <th>City</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>71834</th>\n",
       "      <td>23502</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>NORFOLK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47048</th>\n",
       "      <td>23541</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>NORFOLK</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85821</th>\n",
       "      <td>29616</td>\n",
       "      <td>Debt Not Owed</td>\n",
       "      <td>GREENVILLE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89304</th>\n",
       "      <td>35209</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>BIRMINGHAM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>94201</th>\n",
       "      <td>41042</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>FLORENCE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30939</th>\n",
       "      <td>55164</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>ST. PAUL</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80573</th>\n",
       "      <td>61702</td>\n",
       "      <td>Billing Dispute</td>\n",
       "      <td>BLOOMIGTON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13450</th>\n",
       "      <td>70711</td>\n",
       "      <td>Contract Dispute</td>\n",
       "      <td>CLIFTON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12102</th>\n",
       "      <td>77056</td>\n",
       "      <td>Debt Not Owed</td>\n",
       "      <td>HOUSTON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29136</th>\n",
       "      <td>77092</td>\n",
       "      <td>False Advertising</td>\n",
       "      <td>HOUSTON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44008</th>\n",
       "      <td>90010</td>\n",
       "      <td>Billing Dispute</td>\n",
       "      <td>LOS ANGELES</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>57636</th>\n",
       "      <td>92123</td>\n",
       "      <td>Harassment</td>\n",
       "      <td>SAN DIEGO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71001</th>\n",
       "      <td>92123</td>\n",
       "      <td>Billing Dispute</td>\n",
       "      <td>SAN DIEGO</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      Incident Zip         Descriptor         City\n",
       "71834        23502         Harassment      NORFOLK\n",
       "47048        23541         Harassment      NORFOLK\n",
       "85821        29616      Debt Not Owed   GREENVILLE\n",
       "89304        35209         Harassment   BIRMINGHAM\n",
       "94201        41042         Harassment     FLORENCE\n",
       "30939        55164         Harassment     ST. PAUL\n",
       "80573        61702    Billing Dispute   BLOOMIGTON\n",
       "13450        70711   Contract Dispute      CLIFTON\n",
       "12102        77056      Debt Not Owed      HOUSTON\n",
       "29136        77092  False Advertising      HOUSTON\n",
       "44008        90010    Billing Dispute  LOS ANGELES\n",
       "57636        92123         Harassment    SAN DIEGO\n",
       "71001        92123    Billing Dispute    SAN DIEGO"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests[is_far][['Incident Zip', 'Descriptor', 'City']].sort_values('Incident Zip')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, there really are requests coming from LA and Houston! Good to know. Filtering by zip code is probably a bad way to handle this -- we should really be looking at the city instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "BROOKLYN                  31662\n",
       "NEW YORK                  22664\n",
       "BRONX                     18438\n",
       "STATEN ISLAND              4766\n",
       "JAMAICA                    2246\n",
       "                          ...  \n",
       "BRIARWOOD                     1\n",
       "PORT JEFFERSON STATION        1\n",
       "NJ                            1\n",
       "ROSELYN                       1\n",
       "RYEBROOK                      1\n",
       "Name: City, Length: 100, dtype: int64"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests['City'].str.upper().value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like these are legitimate complaints, so we'll just leave them alone."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7.4 Putting it together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's what we ended up doing to clean up our zip codes, all together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [],
   "source": [
    "na_values = ['NO CLUE', 'N/A', '0']\n",
    "requests = pd.read_csv('../data/311-service-requests.csv', \n",
    "                       na_values=na_values, \n",
    "                       dtype={'Incident Zip': str})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fix_zip_codes(zips):\n",
    "    # Truncate everything to length 5 \n",
    "    zips = zips.str.slice(0, 5)\n",
    "    \n",
    "    # Set 00000 zip codes to nan\n",
    "    zero_zips = zips == '00000'\n",
    "    zips[zero_zips] = np.nan\n",
    "    \n",
    "    return zips"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [],
   "source": [
    "requests['Incident Zip'] = fix_zip_codes(requests['Incident Zip'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['11432', '11378', '10032', '10023', '10027', '11372', '11419',\n",
       "       '11417', '10011', '11225', '11218', '10003', '10029', '10466',\n",
       "       '11219', '10025', '10310', '11236', nan, '10033', '11216', '10016',\n",
       "       '10305', '10312', '10026', '10309', '10036', '11433', '11235',\n",
       "       '11213', '11379', '11101', '10014', '11231', '11234', '10457',\n",
       "       '10459', '10465', '11207', '10002', '10034', '11233', '10453',\n",
       "       '10456', '10469', '11374', '11221', '11421', '11215', '10007',\n",
       "       '10019', '11205', '11418', '11369', '11249', '10005', '10009',\n",
       "       '11211', '11412', '10458', '11229', '10065', '10030', '11222',\n",
       "       '10024', '10013', '11420', '11365', '10012', '11214', '11212',\n",
       "       '10022', '11232', '11040', '11226', '10281', '11102', '11208',\n",
       "       '10001', '10472', '11414', '11223', '10040', '11220', '11373',\n",
       "       '11203', '11691', '11356', '10017', '10452', '10280', '11217',\n",
       "       '10031', '11201', '11358', '10128', '11423', '10039', '10010',\n",
       "       '11209', '10021', '10037', '11413', '11375', '11238', '10473',\n",
       "       '11103', '11354', '11361', '11106', '11385', '10463', '10467',\n",
       "       '11204', '11237', '11377', '11364', '11434', '11435', '11210',\n",
       "       '11228', '11368', '11694', '10464', '11415', '10314', '10301',\n",
       "       '10018', '10038', '11105', '11230', '10468', '11104', '10471',\n",
       "       '11416', '10075', '11422', '11355', '10028', '10462', '10306',\n",
       "       '10461', '11224', '11429', '10035', '11366', '11362', '11206',\n",
       "       '10460', '10304', '11360', '11411', '10455', '10475', '10069',\n",
       "       '10303', '10308', '10302', '11357', '10470', '11367', '11370',\n",
       "       '10454', '10451', '11436', '11426', '10153', '11004', '11428',\n",
       "       '11427', '11001', '11363', '10004', '10474', '11430', '10000',\n",
       "       '10307', '11239', '10119', '10006', '10048', '11697', '11692',\n",
       "       '11693', '10573', '00083', '11559', '10020', '77056', '11776',\n",
       "       '70711', '10282', '11109', '10044', '02061', '77092', '14225',\n",
       "       '55164', '19711', '07306', '90010', '11747', '23541', '11788',\n",
       "       '07604', '10112', '11563', '11580', '07087', '11042', '07093',\n",
       "       '11501', '92123', '11575', '07109', '11797', '10803', '11716',\n",
       "       '11722', '11549', '10162', '23502', '11518', '07020', '08807',\n",
       "       '11577', '07114', '11003', '07201', '61702', '10103', '29616',\n",
       "       '35209', '11520', '11735', '10129', '11005', '41042', '11590',\n",
       "       '06901', '07208', '11530', '13221', '10954', '11111', '10107'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "requests['Incident Zip'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<style>\n",
    "    @font-face {\n",
    "        font-family: \"Computer Modern\";\n",
    "        src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
    "    }\n",
    "    div.cell{\n",
    "        width:800px;\n",
    "        margin-left:16% !important;\n",
    "        margin-right:auto;\n",
    "    }\n",
    "    h1 {\n",
    "        font-family: Helvetica, serif;\n",
    "    }\n",
    "    h4{\n",
    "        margin-top:12px;\n",
    "        margin-bottom: 3px;\n",
    "       }\n",
    "    div.text_cell_render{\n",
    "        font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
    "        line-height: 145%;\n",
    "        font-size: 130%;\n",
    "        width:800px;\n",
    "        margin-left:auto;\n",
    "        margin-right:auto;\n",
    "    }\n",
    "    .CodeMirror{\n",
    "            font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
    "    }\n",
    "    .text_cell_render h5 {\n",
    "        font-weight: 300;\n",
    "        font-size: 22pt;\n",
    "        color: #4057A1;\n",
    "        font-style: italic;\n",
    "        margin-bottom: .5em;\n",
    "        margin-top: 0.5em;\n",
    "        display: block;\n",
    "    }\n",
    "    \n",
    "    .warning{\n",
    "        color: rgb( 240, 20, 20 )\n",
    "        }  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
